Investigate a Dataset - European Soccer Database - Spain LIGA BBVA

Data Set

European Soccer Database

Overview and Notes

This soccer database comes from Kaggle and is well suited for data analysis and machine learning.
It contains data for soccer matches, players, and teams from several European countries from 2008 to 2016.
This dataset is quite extensive, and we encourage you to read more about it here.
The database is stored in a SQLite database. You can access database files using software like DB Browser.
This dataset will help you get good practice with your SQL joins.
Make sure to look at how the different tables relate to each other.
Some column titles should be self-explanatory, and others you’ll have to look up on Kaggle.

Sources

http://www.numpy.org

https://pandas.pydata.org

http://pandas.pydata.org/pandas-docs/version/0.9.1/visualization.html

https://matplotlib.org/gallery.html

https://seaborn.pydata.org

https://www.udacity.com/course/intro-to-programming-nanodegree--nd000

https://docs.google.com/document/d/e/2PACX-1vTlVmknRRnfy_4eTrjw5hYGaiQim5ctr9naaRd4V9du2B5bxpd8FEH3KtDgp8qVekw7Cj1GLk1IXdZi/pub?embedded=True

https://www.kaggle.com/hugomathien/soccer

https://github.com/georgezoto/Intro-to-Programming-Nanodegree/blob/09dce07d512bf2050f0c5f5570fb207ed9855001/Specialization/Data%20Analysis/Investigate%20a%20Dataset%20-%20European%20Soccer%20Database/Investigate%20a%20Dataset%20-%20European%20Soccer%20Database%20-%20Spain%20LIGA%20BBVA.ipynb

Example Questions

How many goals has each team scored home, away and total on each season?
How many goals has each team allowed home, away and total on each season?
How teams improved or deteriorated over time and what teams improved or deteriorated the most and the least in terms of goals scored or allowed over time?
How many goals has each team allowed home, away and total, from other teams, over all seasons?
Is there such a thing a home advantage?
Which players had the most penalties?
What team attributes lead to the most victories?

Import data analysis modules

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
import seaborn as sns

Connect to local sqlite database with the following tables:

Country, League, Match, Player, Player_Attributes, Team, Team_Attributes

In [2]:
connection = sqlite3.connect('database.sqlite')

Run any query, store and display results in a DataFrame

Select Country, League, Date, Team Home, Team Away and final score for every match in the dataset that belongs to the Spanish LIGA BBVA league.

In [3]:
query = '''
select 
    Country.name as Country, 
    League.name as League, 
    Match.season as Season,
    date(Match.date) as Date,
    Team_home.team_long_name as Team_Home, 
    Team_away.team_long_name as Team_Away, 
    Match.home_team_goal as Team_Home_Goal, 
    Match.away_team_goal as Team_Away_Goal
from
    Country, League, Team as Team_home, Team as Team_away, Match
where
    Country.id = League.country_id AND
    Country.id = Match.country_id AND
    League.id = Match.league_id AND
    Team_home.team_api_id = Match.home_team_api_id AND
    Team_away.team_api_id = Match.away_team_api_id AND
    Country = 'Spain' AND
    League.name = 'Spain LIGA BBVA'
order by
    Match.date
-- limit 10;
'''

#Run query and store results in a dataframe
spain_liga_bbva_df = pd.read_sql(query,connection)
In [4]:
print(spain_liga_bbva_df.columns)
Index([u'Country', u'League', u'Season', u'Date', u'Team_Home', u'Team_Away',
       u'Team_Home_Goal', u'Team_Away_Goal'],
      dtype='object')

Create helpful functions that will be used repeatedly for resetting an index of an input series, store result in a dataframe, pivot it so that each Season is a separate column and finally plot its input columns over its input index

In [5]:
"""
reset_index_pivot_stacked_bar_plot_series(Args) is used for resetting an index of an input series, store result in a dataframe, pivot it so that each Season is a separate column and finally plot its input columns over its input index.

Args:
    plot_type: Type of chart plotted, 'stacked_bar' or 'heat_map' 
    season_team_scored_allowed_sum: input pandas Series containing Season, Team and sum of goals scored or allowed for each season 
    index: 'Team_Home' or 'Team_Away' or 'Team' index used for x label plotting
    columns: Columns to pivot resulting dataframe, usually 'Season', used as x label subplotting
    values: 'Team_Home_Goal' or 'Team_Away_Goal' or 'Team_Goal' or 'Team_Allowed' columns used as values of a plot
    title: Title of a plot
    xlabel: X axis label of a plot
    ylabel: Y axis label of a plot
    
Returns:
    Pivoted DataFrame with each Season in a separate column.

Raises:
    None
"""
def reset_index_pivot_plot_series(plot_type, plot_color, season_team_scored_allowed_sum, \
    input_index, input_columns, input_values, title, xlabel, ylabel, reset_flag, pivot_flag):
    
    if reset_flag:
        #Reset index of series so 'Season' column is fully populated, store in pandas DataFrame
        season_team_scored_allowed_sum = season_team_scored_allowed_sum.reset_index()
    if pivot_flag:
        #Pivot dataframe so that each Season is a separate column, needed for multi-series plots
        season_team_scored_allowed_sum = season_team_scored_allowed_sum.pivot(index=input_index, columns=input_columns, values=input_values)
    
    if plot_type == 'stacked_bar':
        #Plot a stacked bar of final DataFrame with details provided 
        season_team_scored_allowed_sum.plot(kind='bar', stacked=True, figsize=(20, 15))
    elif plot_type == 'heat_map':
        #Plot a heat map of final DataFrame with details provided 
        plt.subplots(figsize=(20,15))
        sns.heatmap(season_team_scored_allowed_sum, annot=True, cmap=plot_color)
    else:
        #Plot a stacked bar of final DataFrame with details provided 
        season_team_scored_allowed_sum.plot(kind='bar', stacked=True, figsize=(20, 15))
        
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()
    
    #Return final DataFrame
    return season_team_scored_allowed_sum

Compute home and away goals scored per season and add them up, store in pandas Series

In [6]:
season_team_home_goal_sum = spain_liga_bbva_df.groupby(['Season', 'Team_Home'])['Team_Home_Goal'].sum()
season_team_away_goal_sum = spain_liga_bbva_df.groupby(['Season', 'Team_Away'])['Team_Away_Goal'].sum()
season_team_goal_sum = season_team_home_goal_sum + season_team_away_goal_sum
print type(season_team_home_goal_sum)
print season_team_home_goal_sum.head(30)
<class 'pandas.core.series.Series'>
Season     Team_Home                
2008/2009  Athletic Club de Bilbao      28
           Atlético Madrid              47
           CA Osasuna                   27
           CD Numancia                  23
           FC Barcelona                 61
           Getafe CF                    27
           Málaga CF                    28
           RC Deportivo de La Coruña    30
           RC Recreativo                17
           RCD Espanyol                 28
           RCD Mallorca                 33
           Racing Santander             28
           Real Betis Balompié          24
           Real Madrid CF               49
           Real Sporting de Gijón       24
           Real Valladolid              22
           Sevilla FC                   28
           UD Almería                   27
           Valencia CF                  47
           Villarreal CF                33
2009/2010  Athletic Club de Bilbao      34
           Atlético Madrid              37
           CA Osasuna                   21
           CD Tenerife                  26
           FC Barcelona                 57
           Getafe CF                    34
           Málaga CF                    21
           RC Deportivo de La Coruña    22
           RCD Espanyol                 20
           RCD Mallorca                 42
Name: Team_Home_Goal, dtype: int64

Interest 1a: We are interested in computing and plotting the goals scored home on each season for each team. This way we can see if a team is performing very well in terms of goals scored home. We would also like to see if any team has performed better or worse over time. A simple bar chart with each team on the x-axis and goals scored home on the y-axis, aggregated over each season should help us get an idea.

Question 1a: How many goals has each team scored home on each season?

Answer 1a: Home goals scored per Season

In [7]:
season_team_home_goal_sum = reset_index_pivot_plot_series('stacked_bar', '', season_team_home_goal_sum, \
    'Team_Home', 'Season', 'Team_Home_Goal', 'Home goals scored per Season', 'Team', 'Home Goals Scored', True, True)
season_team_home_goal_sum.head()
Out[7]:
Season 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
Team_Home
Athletic Club de Bilbao 28.0 34.0 32.0 29.0 22.0 42.0 28.0 35.0
Atlético Madrid 47.0 37.0 35.0 36.0 42.0 49.0 42.0 33.0
CA Osasuna 27.0 21.0 28.0 26.0 15.0 20.0 NaN NaN
CD Numancia 23.0 NaN NaN NaN NaN NaN NaN NaN
CD Tenerife NaN 26.0 NaN NaN NaN NaN NaN NaN

Conclusion 1a: This graph shows two best performing team in terms of goals scored home, FC Barcelona and Real Madrid. Looking at the chart though is not that easy to notice any improvement or deterioration over time for each team. This is a limitation of this specific chart that we will explore next.

If we display the same information, but this time using a heat map, we can clearly see how each team has performed over time. For example we can see that both FC Barcelona and Real Madrid peaked their performance in the 2011/2012 season and are getting ready for a comeback in the 2015/2016 season.

In [8]:
season_team_home_goal_sum = reset_index_pivot_plot_series('heat_map', 'YlGnBu', season_team_home_goal_sum, \
    'Team_Home', 'Season', 'Team_Home_Goal', 'Home goals scored per Season', 'Season', 'Team', False, False)

Interest 1b: Next, we are interested in computing and plotting which team performed better or worse over time in terms of goals scored home. A heatmap chart with each team on the y-axis and goals scored home differentials (goal difference from season to season) as values, with each season on the x-axis, should give us an idea.

Question 1b: How teams improved over time and what teams improved the most and the least in terms of goals scored home over time?

Answer 1b: Home goals scored over Seasons

In [9]:
season_team_home_goal_sum_diff = season_team_home_goal_sum.diff(periods=1, axis='columns')

season_team_home_goal_sum_diff = reset_index_pivot_plot_series('heat_map', 'cool', season_team_home_goal_sum_diff, \
    'Team_Home', 'Season', 'Team_Home_Goal', 'Home goals scored differentials over Seasons', 'Season', 'Team', False, False)

Conclusion 1b: This graph shows two best performing team in terms of goals scored home differentials, FC Barcelona and Athletic Club de Bilbao. Barcelona had a stellar improvement in the 2011/2012 season compared to its previous season. Teams like Mallorca and Valencia on the other hand had a bad season in terms of our metric in 2010/2011 and 2015/2016 seasons correspondingly.

Interest 2a: Similarly to goals scored home, we are also interested in computing and plotting the goals scored away on each season for each team. This way we can see if a team is performing very well in terms of goals scored away. We would also like to see if any team has performed better or worse over time. A simple bar chart with each team on the x-axis and goals scored away on the y-axis, aggregated over each season should help us get an idea.

Question 2a: How many goals has each team scored away on each season?

Answer 2a: Away goals scored per Season

In [10]:
season_team_away_goal_sum = reset_index_pivot_plot_series('stacked_bar', '', season_team_away_goal_sum, \
    'Team_Away', 'Season', 'Team_Away_Goal', 'Away goals scored per Season', 'Team', 'Away Goals Scored', True, True)
season_team_away_goal_sum.head()
Out[10]:
Season 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
Team_Away
Athletic Club de Bilbao 19.0 16.0 27.0 20.0 22.0 24.0 14.0 23.0
Atlético Madrid 33.0 20.0 27.0 17.0 23.0 28.0 25.0 30.0
CA Osasuna 14.0 16.0 17.0 18.0 18.0 12.0 NaN NaN
CD Numancia 15.0 NaN NaN NaN NaN NaN NaN NaN
CD Tenerife NaN 14.0 NaN NaN NaN NaN NaN NaN

Conclusion 2a: This graph shows two best performing team in terms of goals scored home, FC Barcelona and Real Madrid. Looking at the chart though is not that easy to notice any improvement or deterioration over time for each team. This is a limitation of this specific chart that we will explore next.

If we display the same information, but this time using a heat map, we can clearly see how each team has performed over time. For example we can see that both FC Barcelona and Real Madrid peaked their performance in the 2012/2013 and 2014/2015 season correspondingly.

In [11]:
season_team_away_goal_sum = reset_index_pivot_plot_series('heat_map', 'YlGnBu', season_team_away_goal_sum, \
    'Team_Away', 'Season', 'Team_Away_Goal', 'Away goals scored per Season', 'Season', 'Team', False, False)

Interest 2b: Next, we are interested in computing and plotting which team performed better or worse over time in terms of goals scored away. A heatmap chart with each team on the y-axis and goals scored away differentials (goal difference from season to season) as values, with each season on the x-axis, should give us an idea.

Question 2b: How teams improved over time and what teams improved the most and the least in terms of goals scored away over time?

Answer 2b: Away goals scored over Seasons

In [12]:
season_team_away_goal_sum_diff = season_team_away_goal_sum.diff(periods=1, axis='columns')

season_team_away_goal_sum_diff = reset_index_pivot_plot_series('heat_map', 'cool', season_team_away_goal_sum_diff, \
    'Team_Away', 'Season', 'Team_Away_Goal', 'Away goals scored differentials over Seasons', 'Season', 'Team', False, False)

Conclusion 2b: This graph shows several best performing teams in terms of goals scored away differentials, Bilbao, Barcelona, Celta de Vigo, Real Madrid, Real Sociedad, Sevilla and Valencia had significant improvement over several seasons compared to their previous season. Teams like Sevilla had a bad season in terms of our metric in the 2015/2016 season.

Interest 3a: We are now interested in computing and plotting the total goals scored (home or away) on each season for each team. This way we can see if a team is performing very well in terms of total goals scored. We would also like to see if any team has performed better or worse over time. A simple bar chart with each team on the x-axis and total goals scored on the y-axis, aggregated over each season should help us get an idea.

Question 3a: How many goals has each team scored on each season?

Answer 3a: Total goals scored per Season

In [13]:
season_team_goal_sum = season_team_goal_sum.reset_index()
season_team_goal_sum.rename(columns={'Season': 'Season', 'Team_Home': 'Team', 0: 'Team_Goal'}, inplace=True)

season_team_goal_sum = reset_index_pivot_plot_series('stacked_bar', '', season_team_goal_sum, \
    'Team', 'Season', 'Team_Goal', 'Total goals scored per Season', 'Team', 'Total Goals Scored', False, True)
season_team_goal_sum.head()
Out[13]:
Season 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
Team
Athletic Club de Bilbao 47.0 50.0 59.0 49.0 44.0 66.0 42.0 58.0
Atlético Madrid 80.0 57.0 62.0 53.0 65.0 77.0 67.0 63.0
CA Osasuna 41.0 37.0 45.0 44.0 33.0 32.0 NaN NaN
CD Numancia 38.0 NaN NaN NaN NaN NaN NaN NaN
CD Tenerife NaN 40.0 NaN NaN NaN NaN NaN NaN

Conclusion 3a: This graph shows two best performing team in terms of goals scored home, FC Barcelona and Real Madrid. Looking at the chart though is not that easy to notice any improvement or deterioration over time for each team. This is a limitation of this specific chart that we will explore next.

If we display the same information, but this time using a heat map, we can clearly see how each team has performed over time. For example we can see that both FC Barcelona and Real Madrid peaked their performance in the 2011/2012 season as well as in the 2014/2015 and 2015/2016 season.

In [14]:
season_team_goal_sum = reset_index_pivot_plot_series('heat_map', 'YlGnBu', season_team_goal_sum, \
    'Team', 'Season', 'Team_Goal', 'Total goals scored per Season', 'Season', 'Team', False, False)

Interest 3b: Next, we are interested in computing and plotting which team performed better or worse over time in terms of total goals scored. A heatmap chart with each team on the y-axis and total goals scored differentials (goal difference from season to season) as values, with each season on the x-axis, should give us an idea.

Question 3b: How teams improved over time and what teams improved the most and the least in terms of total goals scored over time?

Answer 3b: Total goals scored over Seasons

In [15]:
season_team_goal_sum_diff = season_team_goal_sum.diff(periods=1, axis='columns')

season_team_goal_sum_diff = reset_index_pivot_plot_series('heat_map', 'cool', season_team_goal_sum_diff, \
    'Team', 'Season', 'Team_Goal', 'Total goals scored differentials over Seasons', 'Season', 'Team', False, False)

Conclusion 3b: This graph shows several best performing teams in terms of total goals scored differentials, Bilbao, Barcelona, Real Madrid, Real Sociedad and Valencia. Real Sociedad had a stellar improvement in the 2012/2013 season compared to its previous season. Teams like Bilbao and Valencia had a bad season in terms of our metric in the 2014/2015 and 2015/2016 season correspondingly.

Let us switch gears a little bit now and look at similar statistics but this time for a new metric, goals allowed at home, away and overall.

Compute home and away goals allowed per season and add them up, store in pandas Series

In [16]:
season_team_home_allowed_sum = spain_liga_bbva_df.groupby(['Season', 'Team_Home'])['Team_Away_Goal'].sum()
season_team_away_allowed_sum = spain_liga_bbva_df.groupby(['Season', 'Team_Away'])['Team_Home_Goal'].sum()
season_team_allowed_sum = season_team_home_allowed_sum + season_team_away_allowed_sum
print type(season_team_home_allowed_sum)
print season_team_home_allowed_sum.head(30)
<class 'pandas.core.series.Series'>
Season     Team_Home                
2008/2009  Athletic Club de Bilbao      29
           Atlético Madrid              23
           CA Osasuna                   22
           CD Numancia                  22
           FC Barcelona                 14
           Getafe CF                    23
           Málaga CF                    23
           RC Deportivo de La Coruña    18
           RC Recreativo                29
           RCD Espanyol                 22
           RCD Mallorca                 24
           Racing Santander             22
           Real Betis Balompié          25
           Real Madrid CF               29
           Real Sporting de Gijón       37
           Real Valladolid              17
           Sevilla FC                   20
           UD Almería                   20
           Valencia CF                  26
           Villarreal CF                25
2009/2010  Athletic Club de Bilbao      24
           Atlético Madrid              24
           CA Osasuna                   16
           CD Tenerife                  29
           FC Barcelona                 11
           Getafe CF                    22
           Málaga CF                    21
           RC Deportivo de La Coruña    21
           RCD Espanyol                 17
           RCD Mallorca                 15
Name: Team_Away_Goal, dtype: int64

Interest 4a: We are interested in computing and plotting the goals allowed home on each season for each team. This way we can see if a team is performing very bad in terms of goals allowed home. We would also like to see if any team has performed better or worse over time. A simple bar chart with each team on the x-axis and goals allowed home on the y-axis, aggregated over each season should help us get an idea.

Question 4a: How many goals has each team allowed home on each season?

Answer 4a: Home goals allowed per Season

In [17]:
season_team_home_allowed_sum = reset_index_pivot_plot_series('stacked_bar', '', season_team_home_allowed_sum, \
    'Team_Home', 'Season', 'Team_Away_Goal', 'Home goals allowed per Season', 'Team', 'Home Goals Allowed', True, True)
season_team_home_allowed_sum.head()
Out[17]:
Season 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
Team_Home
Athletic Club de Bilbao 29.0 24.0 20.0 21.0 27.0 18.0 20.0 17.0
Atlético Madrid 23.0 24.0 20.0 17.0 12.0 10.0 11.0 7.0
CA Osasuna 22.0 16.0 14.0 19.0 14.0 25.0 NaN NaN
CD Numancia 22.0 NaN NaN NaN NaN NaN NaN NaN
CD Tenerife NaN 29.0 NaN NaN NaN NaN NaN NaN

Conclusion 4a: This graph shows two worst performing team in terms of goals allowed home, Getafe CF and RCD Espanyol. Looking at the chart though is not that easy to notice any improvement or deterioration over time for each team. This is a limitation of this specific chart that we will explore next.

If we display the same information, but this time using a heat map, we can clearly see how each team has performed over time. For example we can see that Rayo Vallecano, Real Sporting de Gijon and Almeria had their worst performance in the 2013/2014, 2008/2009 and 2010/2011 season correspondingly.

In [18]:
season_team_home_allowed_sum = reset_index_pivot_plot_series('heat_map', 'RdPu', season_team_home_allowed_sum, \
    'Team_Home', 'Season', 'Team_Away_Goal', 'Home goals allowed per Season', 'Season', 'Team', False, False)

Interest 4b: Next, we are interested in computing and plotting which team performed better or worse over time in terms of goals allowed home. A heatmap chart with each team on the y-axis and goals allowed home differentials (goal difference from season to season) as values, with each season on the x-axis, should give us an idea.

Question 4b: How teams deteriorated over time and what teams deteriorated the most and the least in terms of goals allowed home over time?

Answer 4b: Home goals allowed over Seasons

In [19]:
season_team_home_allowed_sum_diff = season_team_home_allowed_sum.diff(periods=1, axis='columns')

season_team_home_allowed_sum_diff = reset_index_pivot_plot_series('heat_map', 'magma_r', season_team_home_allowed_sum_diff, \
    'Team_Home', 'Season', 'Team_Away_Goal', 'Home goals allowed differentials over Seasons', 'Season', 'Team', False, False)

Conclusion 4b: This graph shows three worst performing teams in terms of goals allowed home differentials, Elche CF in their 2014/2015 season, RCD Mallorca in their 2012/2013 season and Real Valladolid in their 2009/2010 season, compared to their previous seasons as always.

Interest 5a: Similarly to goals allowed home, we are also interested in computing and plotting the goals allowed away on each season for each team. This way we can see if a team is performing very bad in terms of goals allowed away. We would also like to see if any team has performed better or worse over time. A simple bar chart with each team on the x-axis and goals allowed away on the y-axis, aggregated over each season should help us get an idea.

Question 5a: How many goals has each team allowed away on each season?

Answer 5a: Away goals allowed per Season

In [20]:
season_team_away_allowed_sum = reset_index_pivot_plot_series('stacked_bar', '', season_team_away_allowed_sum, \
    'Team_Away', 'Season', 'Team_Home_Goal', 'Away goals allowed per Season', 'Team', 'Away Goals Allowed', True, True)
season_team_away_allowed_sum.head()
Out[20]:
Season 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
Team_Away
Athletic Club de Bilbao 33.0 29.0 35.0 31.0 38.0 21.0 21.0 28.0
Atlético Madrid 34.0 37.0 33.0 29.0 19.0 16.0 18.0 11.0
CA Osasuna 25.0 30.0 32.0 42.0 36.0 37.0 NaN NaN
CD Numancia 47.0 NaN NaN NaN NaN NaN NaN NaN
CD Tenerife NaN 45.0 NaN NaN NaN NaN NaN NaN

Conclusion 5a: This graph shows two worst performing teams in terms of goals allowed away, Getafe CF and RCD Espanyol. Looking at the chart though is not that easy to notice any improvement or deterioration over time for each team. This is a limitation of this specific chart that we will explore next.

If we display the same information, but this time using a heat map, we can clearly see how each team has performed over time. For example we can see that teams like CD Numancia, Getafe, Grenada, Rayo, Betis Balompie had some underperforming seasons in 2008, 2015, 2014 2011, 2013 correspondingly.

In [21]:
season_team_away_allowed_sum = reset_index_pivot_plot_series('heat_map', 'RdPu', season_team_away_allowed_sum, \
    'Team_Away', 'Season', 'Team_Home_Goal', 'Away goals allowed per Season', 'Season', 'Team', False, False)

Interest 5b: Next, we are interested in computing and plotting which team performed better or worse over time in terms of goals allowed away. A heatmap chart with each team on the y-axis and goals allowed away differentials (goal difference from season to season) as values, with each season on the x-axis, should give us an idea.

Question 5b: How teams deteriorated over time and what teams deteriorated the most and the least in terms of goals allowed away over time?

Answer 5b: Away goals allowed over Seasons

In [22]:
season_team_away_allowed_sum_diff = season_team_away_allowed_sum.diff(periods=1, axis='columns')

season_team_away_allowed_sum_diff = reset_index_pivot_plot_series('heat_map', 'magma_r', season_team_away_allowed_sum_diff, \
    'Team_Away', 'Season', 'Team_Home_Goal', 'Away goals allowed differentials over Seasons', 'Season', 'Team', False, False)

Conclusion 5b: This graph shows four worst performing teams in terms of goals allowed away differentials, Grenada in their 2014/2015 season, RCD Espanyol in their 2015/2016 season, Gijon in their 2011/2012 season and Eibar in their 2015/2016 season, compared to their previous seasons as always.

Interest 6a: We are now interested in computing and plotting the total goals allowed, home or away, on each season for each team. This way we can see if a team is performing very bad in terms of total goals allowed. We would also like to see if any team has performed better or worse over time. A simple bar chart with each team on the x-axis and total goals allowed on the y-axis, aggregated over each season should help us get an idea

Question 6a: How many goals has each team allowed on each season?

Answer 6a: Total goals allowed per Season

In [23]:
season_team_allowed_sum = season_team_allowed_sum.reset_index()
season_team_allowed_sum.rename(columns={'Season': 'Season', 'Team_Home': 'Team', 0: 'Team_Allowed'}, inplace=True)

season_team_allowed_sum = reset_index_pivot_plot_series('stacked_bar', '', season_team_allowed_sum, \
    'Team', 'Season', 'Team_Allowed', 'Total goals allowed per Season', 'Team', 'Total Goals Allowed', False, True)
season_team_allowed_sum.head()
Out[23]:
Season 2008/2009 2009/2010 2010/2011 2011/2012 2012/2013 2013/2014 2014/2015 2015/2016
Team
Athletic Club de Bilbao 62.0 53.0 55.0 52.0 65.0 39.0 41.0 45.0
Atlético Madrid 57.0 61.0 53.0 46.0 31.0 26.0 29.0 18.0
CA Osasuna 47.0 46.0 46.0 61.0 50.0 62.0 NaN NaN
CD Numancia 69.0 NaN NaN NaN NaN NaN NaN NaN
CD Tenerife NaN 74.0 NaN NaN NaN NaN NaN NaN

Conclusion 6a: This graph shows two worst performing team in terms of total goals allowed, Getafe CF and RCD Espanyol. Looking at the chart though is not that easy to notice any improvement or deterioration over time for each team. This is a limitation of this specific chart that we will explore next.

If we display the same information, but this time using a heat map, we can clearly see how each team has performed over time. For example we can see that Rayo Vallecano, Real Betis and Gijon had their worst performance in the 2013/2014, 2013/2014 and 2008/2009 season correspondingly.

In [24]:
season_team_allowed_sum = reset_index_pivot_plot_series('heat_map', 'RdPu', season_team_allowed_sum, \
    'Team', 'Season', 'Team_Allowed', 'Total goals allowed per Season', 'Season', 'Team', False, False)

Interest 6b: Next, we are interested in computing and plotting which team performed better or worse over time in terms of total goals allowed. A heatmap chart with each team on the y-axis and total goals allowed differentials (goal difference from season to season) as values, with each season on the x-axis, should give us an idea.

Question 6b: How teams deteriorated over time and what teams deteriorated the most and the least in terms of goals allowed over time?

Answer 6b: Total goals allowed over Seasons

In [25]:
season_team_allowed_sum_diff = season_team_allowed_sum.diff(periods=1, axis='columns')

season_team_allowed_sum = reset_index_pivot_plot_series('heat_map', 'magma_r', season_team_allowed_sum_diff, \
    'Team', 'Season', 'Team_Allowed', 'Total goals allowed differentials over Seasons', 'Season', 'Team', False, False)

Conclusion 6b: This graph shows three worst performing teams in terms of goals allowed home differentials, Levante in their 2014/2015 season, Mallorca in their 2012/2013 season and Gijon in their 2011/2012 season, compared to their previous seasons as always.

So far we have looked at statistics of each team individually.

In this section we explore how teams perform against each other.

Team match statistics could also be used as forecasts or projections for future games against the same teams.

Compute number of unique seasons

In [26]:
unique_seasons = pd.Series(spain_liga_bbva_df['Season'].unique())

A reminder of the structure of our original League DataFrame

In [27]:
spain_liga_bbva_df.head()
Out[27]:
Country League Season Date Team_Home Team_Away Team_Home_Goal Team_Away_Goal
0 Spain Spain LIGA BBVA 2008/2009 2008-08-30 Valencia CF RCD Mallorca 3 0
1 Spain Spain LIGA BBVA 2008/2009 2008-08-30 RCD Espanyol Real Valladolid 1 0
2 Spain Spain LIGA BBVA 2008/2009 2008-08-31 CA Osasuna Villarreal CF 1 1
3 Spain Spain LIGA BBVA 2008/2009 2008-08-31 RC Deportivo de La Coruña Real Madrid CF 2 1
4 Spain Spain LIGA BBVA 2008/2009 2008-08-31 CD Numancia FC Barcelona 1 0

Group original spain_liga_bbva_df DataFrame by Team_Home and Team_Away goals scored and allowed

In [28]:
team_match_goal_sum = spain_liga_bbva_df.groupby(['Team_Home', 'Team_Away'])['Team_Home_Goal', 'Team_Away_Goal'].sum()
team_match_goal_sum.head()
Out[28]:
Team_Home_Goal Team_Away_Goal
Team_Home Team_Away
Athletic Club de Bilbao Atlético Madrid 11 13
CA Osasuna 11 1
CD Numancia 2 0
CD Tenerife 4 1
Córdoba CF 0 1

Interest 7a: We are interested in computing and plotting goals scored home for each team against each other team. This way we can see if a team is performing very well in terms of goals scored home, over all seasons. A heatmap, with each team playing home on the y-axis and each team playing away on the x-axis and goals scored home as values, should help us get an idea.

Question 7a: How many goals has each team scored home, against other teams, over all seasons?

Answer 7a: Home goals scored against other teams

In [29]:
team_match_home_goal_sum = reset_index_pivot_plot_series('heat_map', 'YlGnBu', team_match_goal_sum, \
    'Team_Home', 'Team_Away', 'Team_Home_Goal', \
    'Home goals scored over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team Away', 'Team Home', True, True)

team_match_home_goal_sum.head()
Out[29]:
Team_Away Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Home
Athletic Club de Bilbao NaN 11.0 11.0 2.0 4.0 0.0 3.0 9.0 14.0 6.0 ... 11.0 10.0 4.0 5.0 13.0 14.0 2.0 11.0 11.0 3.0
Atlético Madrid 14.0 NaN 11.0 3.0 3.0 4.0 5.0 10.0 19.0 13.0 ... 15.0 9.0 8.0 5.0 13.0 13.0 1.0 13.0 11.0 1.0
CA Osasuna 6.0 8.0 NaN 2.0 1.0 NaN 2.0 7.0 8.0 4.0 ... 5.0 4.0 6.0 NaN 6.0 4.0 NaN 5.0 5.0 1.0
CD Numancia 1.0 1.0 0.0 NaN NaN NaN NaN 1.0 2.0 NaN ... 2.0 4.0 NaN NaN 0.0 2.0 NaN 2.0 1.0 NaN
CD Tenerife 1.0 1.0 2.0 NaN NaN NaN NaN 0.0 3.0 NaN ... 2.0 0.0 1.0 NaN 1.0 2.0 NaN 0.0 2.0 1.0

5 rows × 33 columns

Conclusion 7a: This graph shows two best performing team in terms of goals scored home, FC Barcelona and Real Madrid. It also shows few of the worst performing teams like Numancia, Cordoba and Deportivo. Interestingly, Bilbao and Getafe seem to not perform that well against the two top teams when they play at their home fields.

Interest 7b: We are interested in computing and plotting goals scored away for each team against each other team. This way we can see if a team is performing very well in terms of goals scored away, over all seasons. A heatmap, with each team playing away on the y-axis and each team playing home on the x-axis and goals scored away as values, should help us get an idea.

Question 7b: How many goals has each team scored away, against other teams, over all seasons?

Answer 7b: Away goals scored against other teams

In [30]:
team_match_away_goal_sum = reset_index_pivot_plot_series('heat_map', 'YlGnBu', team_match_goal_sum, \
    'Team_Away', 'Team_Home', 'Team_Away_Goal', \
    'Away goals scored over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team Home', 'Team Away', True, True)

team_match_away_goal_sum.head()
Out[30]:
Team_Home Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Away
Athletic Club de Bilbao NaN 7.0 10.0 2.0 0.0 1.0 3.0 4.0 7.0 4.0 ... 6.0 7.0 5.0 1.0 7.0 9.0 0.0 8.0 6.0 1.0
Atlético Madrid 13.0 NaN 6.0 1.0 1.0 2.0 4.0 7.0 9.0 5.0 ... 8.0 10.0 5.0 5.0 10.0 4.0 3.0 9.0 8.0 2.0
CA Osasuna 1.0 6.0 NaN 0.0 1.0 NaN 0.0 2.0 5.0 1.0 ... 4.0 6.0 6.0 NaN 2.0 5.0 NaN 3.0 7.0 2.0
CD Numancia 0.0 0.0 0.0 NaN NaN NaN NaN 1.0 0.0 NaN ... 1.0 0.0 NaN NaN 0.0 1.0 NaN 0.0 1.0 NaN
CD Tenerife 1.0 1.0 0.0 NaN NaN NaN NaN 1.0 1.0 NaN ... 2.0 3.0 0.0 NaN 0.0 1.0 NaN 0.0 0.0 1.0

5 rows × 33 columns

Conclusion 7b: This graph shows two best performing team in terms of goals scored home, FC Barcelona and Real Madrid. It also shows few of the worst performing teams like Numancia, Cordoba and Deportivo. Interestingly, Getafe and La Coruna seem to not perform that well against the two top teams when they play away from their home fields.

Interest 7c: We are interested in computing and plotting total goals scored, home or away, for each team against each other team. This way we can see if a team is performing very well in terms of total goals scored, over all seasons. A heatmap, with each team on the y-axis and each team on the x-axis and goals scored from each team on the y-axis over each team on the x-axis as values, should help us get an idea.

Question 7c: How many goals has each team scored, against other teams, over all seasons

Answer 7c: Total goals scored against other teams

In [31]:
team_match_total_goal_sum = team_match_home_goal_sum + team_match_away_goal_sum

team_match_total_goal_sum = reset_index_pivot_plot_series('heat_map', 'YlGnBu', team_match_total_goal_sum, \
    '', '', '', \
    'Total goals scored over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team', 'Team', False, False)

team_match_total_goal_sum.head()
Out[31]:
Team_Away Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Home
Athletic Club de Bilbao NaN 18.0 21.0 4.0 4.0 1.0 6.0 13.0 21.0 10.0 ... 17.0 17.0 9.0 6.0 20.0 23.0 2.0 19.0 17.0 4.0
Atlético Madrid 27.0 NaN 17.0 4.0 4.0 6.0 9.0 17.0 28.0 18.0 ... 23.0 19.0 13.0 10.0 23.0 17.0 4.0 22.0 19.0 3.0
CA Osasuna 7.0 14.0 NaN 2.0 2.0 NaN 2.0 9.0 13.0 5.0 ... 9.0 10.0 12.0 NaN 8.0 9.0 NaN 8.0 12.0 3.0
CD Numancia 1.0 1.0 0.0 NaN NaN NaN NaN 2.0 2.0 NaN ... 3.0 4.0 NaN NaN 0.0 3.0 NaN 2.0 2.0 NaN
CD Tenerife 2.0 2.0 2.0 NaN NaN NaN NaN 1.0 4.0 NaN ... 4.0 3.0 1.0 NaN 1.0 3.0 NaN 0.0 2.0 2.0

5 rows × 33 columns

Conclusion 7c: This graph shows two best performing team in terms of total goals scored, home or away, FC Barcelona and Real Madrid. It also shows few of the worst performing teams like Numancia, Cordoba and Deportivo. Interestingly, Bilbao and Getafe seem to not perform that well against the two top teams, regardless if they play at their home fields or away.

Interest 8a: We are interested in computing and plotting goals allowed home for each team against each other team. This way we can see if a team is performing very bad in terms of goals allowed home, over all seasons. A heatmap, with each team playing home on the y-axis and each team playing away on the x-axis and goals allowed home as values, should help us get an idea.

Question 8a: How many goals has each team allowed home, from other teams, over all seasons?

Answer 8a: Home goals allowed from other teams

In [32]:
team_match_home_allowed_sum = reset_index_pivot_plot_series('heat_map', 'RdPu', team_match_goal_sum, \
    'Team_Home', 'Team_Away', 'Team_Away_Goal', \
    'Home goals allowed over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team Away', 'Team Home', True, True)

team_match_home_allowed_sum.head()
Out[32]:
Team_Away Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Home
Athletic Club de Bilbao NaN 13.0 1.0 0.0 1.0 1.0 4.0 15.0 6.0 3.0 ... 3.0 2.0 4.0 2.0 9.0 6.0 2.0 12.0 8.0 2.0
Atlético Madrid 7.0 NaN 6.0 0.0 1.0 2.0 0.0 13.0 4.0 0.0 ... 3.0 4.0 2.0 2.0 5.0 5.0 0.0 6.0 6.0 2.0
CA Osasuna 10.0 6.0 NaN 0.0 0.0 NaN 1.0 11.0 2.0 5.0 ... 3.0 5.0 0.0 NaN 7.0 2.0 NaN 6.0 6.0 1.0
CD Numancia 2.0 1.0 0.0 NaN NaN NaN NaN 0.0 0.0 NaN ... 1.0 3.0 NaN NaN 2.0 1.0 NaN 1.0 2.0 NaN
CD Tenerife 0.0 1.0 1.0 NaN NaN NaN NaN 5.0 2.0 NaN ... 1.0 0.0 3.0 NaN 2.0 2.0 NaN 0.0 2.0 0.0

5 rows × 33 columns

Conclusion 8a: This graph shows teams like Getafe, Levante and La Coruna seem to not perform that well against other teams and especially against the two top teams, Barcelona and Real Madrid.

Interest 8b: We are interested in computing and plotting goals allowed away for each team against each other team. This way we can see if a team is performing very bad in terms of goals allowed away, over all seasons. A heatmap, with each team playing away on the y-axis and each team playing home on the x-axis and goals allowed away as values, should help us get an idea.

Question 8b: How many goals has each team allowed away, from other teams, over all seasons?

Answer 8b: Away goals allowed from other teams

In [33]:
team_match_away_allowed_sum = reset_index_pivot_plot_series('heat_map', 'RdPu', team_match_goal_sum, \
    'Team_Away', 'Team_Home', 'Team_Home_Goal', \
    'Away goals allowed over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team Home', 'Team Away', True, True)

team_match_away_allowed_sum.head()
Out[33]:
Team_Home Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Away
Athletic Club de Bilbao NaN 14.0 6.0 1.0 1.0 0.0 2.0 25.0 7.0 7.0 ... 4.0 7.0 6.0 2.0 16.0 4.0 0.0 11.0 16.0 0.0
Atlético Madrid 11.0 NaN 8.0 1.0 1.0 0.0 0.0 29.0 6.0 1.0 ... 7.0 2.0 3.0 1.0 9.0 6.0 0.0 13.0 10.0 0.0
CA Osasuna 11.0 11.0 NaN 0.0 2.0 NaN 0.0 24.0 12.0 4.0 ... 7.0 2.0 5.0 NaN 8.0 8.0 NaN 18.0 9.0 1.0
CD Numancia 2.0 3.0 2.0 NaN NaN NaN NaN 4.0 1.0 NaN ... 3.0 0.0 NaN NaN 1.0 2.0 NaN 4.0 2.0 NaN
CD Tenerife 4.0 3.0 1.0 NaN NaN NaN NaN 4.0 2.0 NaN ... 0.0 3.0 1.0 NaN 3.0 1.0 NaN 1.0 5.0 2.0

5 rows × 33 columns

Conclusion 8b: This graph shows teams like Bilbao, Atletico Madrid and Getafe seem to not perform that well against other teams and especially against the two top teams, Barcelona and Real Madrid.

Interest 8c: We are interested in computing and plotting total goals allowed, home or away, for each team against each other team. This way we can see if a team is performing very bad in terms of total goals allowed, over all seasons. A heatmap, with each team on the y-axis and each team on the x-axis and goals allowed from each team on the y-axis from each team on the x-axis as values, should help us get an idea.

Question 8c: How many goals has each team allowed, from other teams, over all seasons

Answer 8c: Total goals allowed from other teams

In [34]:
team_match_total_allowed_sum = team_match_home_allowed_sum + team_match_away_allowed_sum

team_match_total_allowed_sum = reset_index_pivot_plot_series('heat_map', 'RdPu', team_match_total_allowed_sum, \
    '', '', '', \
    'Total goals allowed over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team', 'Team', False, False)

team_match_total_allowed_sum.head()
Out[34]:
Team_Away Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Home
Athletic Club de Bilbao NaN 27.0 7.0 1.0 2.0 1.0 6.0 40.0 13.0 10.0 ... 7.0 9.0 10.0 4.0 25.0 10.0 2.0 23.0 24.0 2.0
Atlético Madrid 18.0 NaN 14.0 1.0 2.0 2.0 0.0 42.0 10.0 1.0 ... 10.0 6.0 5.0 3.0 14.0 11.0 0.0 19.0 16.0 2.0
CA Osasuna 21.0 17.0 NaN 0.0 2.0 NaN 1.0 35.0 14.0 9.0 ... 10.0 7.0 5.0 NaN 15.0 10.0 NaN 24.0 15.0 2.0
CD Numancia 4.0 4.0 2.0 NaN NaN NaN NaN 4.0 1.0 NaN ... 4.0 3.0 NaN NaN 3.0 3.0 NaN 5.0 4.0 NaN
CD Tenerife 4.0 4.0 2.0 NaN NaN NaN NaN 9.0 4.0 NaN ... 1.0 3.0 4.0 NaN 5.0 3.0 NaN 1.0 7.0 2.0

5 rows × 33 columns

Conclusion 8c: This graph shows teams like Bilbao, Atletico Madrid and Getafe seem to not perform that well against other teams and especially against the two top teams, Barcelona and Real Madrid.

Apart from individual goals scored or allowed, let us consider a match outcome in terms of win (more goals scored), tie (no goals or equal goals scored) or loss (rival team scored more goals) for each team.

In [35]:
"""
categorize_win_tie_loss(team_match_df) is used for categorizing the outcome of match as a win, tie or loss for the home team.

Args:
    team_match_df: Team match DataFrame with columns 'Team_Home_Goal' and 'Team_Away_Goal' available.
Returns:
    Updated DataFrame with 3 new columns created in case of a match home win, tie or loss.

Raises:
    None
"""
def categorize_win_tie_loss(team_match_df):
    #Reference
    #https://pandas.pydata.org/pandas-docs/stable/indexing.html#setting-with-enlargement
    team_match_df.loc[:, 'Team_Home_Win'] = 0
    team_match_df.loc[:, 'Team_Home_Tie'] = 0
    team_match_df.loc[:, 'Team_Away_Win'] = 0
    
    team_match_df.loc[team_match_df['Team_Home_Goal'] > team_match_df['Team_Away_Goal'], 'Team_Home_Win'] = 1
    team_match_df.loc[team_match_df['Team_Home_Goal'] == team_match_df['Team_Away_Goal'], 'Team_Home_Tie'] = 1
    team_match_df.loc[team_match_df['Team_Home_Goal'] < team_match_df['Team_Away_Goal'], 'Team_Away_Win'] = 1
    return team_match_df

Enhance our original League Dataframe with win, tie or loss outcome for each match.

In [36]:
spain_liga_bbva_df = categorize_win_tie_loss(spain_liga_bbva_df)
spain_liga_bbva_df.head()
Out[36]:
Country League Season Date Team_Home Team_Away Team_Home_Goal Team_Away_Goal Team_Home_Win Team_Home_Tie Team_Away_Win
0 Spain Spain LIGA BBVA 2008/2009 2008-08-30 Valencia CF RCD Mallorca 3 0 1 0 0
1 Spain Spain LIGA BBVA 2008/2009 2008-08-30 RCD Espanyol Real Valladolid 1 0 1 0 0
2 Spain Spain LIGA BBVA 2008/2009 2008-08-31 CA Osasuna Villarreal CF 1 1 0 1 0
3 Spain Spain LIGA BBVA 2008/2009 2008-08-31 RC Deportivo de La Coruña Real Madrid CF 2 1 1 0 0
4 Spain Spain LIGA BBVA 2008/2009 2008-08-31 CD Numancia FC Barcelona 1 0 1 0 0

Group original League DataFrame by Team_Home and Team_Away win, tie or loss after each match

In [37]:
team_match_outcome_sum = spain_liga_bbva_df.groupby(['Team_Home', 'Team_Away']) \
    ['Team_Home_Win', 'Team_Home_Tie', 'Team_Away_Win'].sum()
team_match_outcome_sum.head()
Out[37]:
Team_Home_Win Team_Home_Tie Team_Away_Win
Team_Home Team_Away
Athletic Club de Bilbao Atlético Madrid 3 0 5
CA Osasuna 6 0 0
CD Numancia 1 0 0
CD Tenerife 1 0 0
Córdoba CF 0 0 1

Let's set a new index for our team_match_outcome_sum Dataframe as the pair of home and away teams of a match

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html

In [38]:
team_match_outcome_sum = team_match_outcome_sum.reset_index()
team_match_outcome_sum = team_match_outcome_sum.set_index(['Team_Home', 'Team_Away'])

Create a function that for each pair of teams (team1, team2) will add home wins for team1 with away wins for team1 for each pair of teams (team2, team1). Similarly for ties.

In [39]:
"""
add_home_and_away_wins_ties(team_match_df) is used to keep track of wins earned home or away for each pair of teams.
Similarly for ties, home or away.

Args:
    team_match_df: Team match DataFrame with columns ['Team_Home', 'Team_Away', 'Team_Home_Win', 'Team_Home_Tie', 'Team_Away_Win'] available.
Returns:
    Updated DataFrame with 2 new columns created, one for total wins, one for total ties.

Raises:
    None
"""
def add_home_and_away_wins_ties(team_match_df):
    team_match_df.loc[:, 'Team_Win'] = 0
    team_match_df.loc[:, 'Team_Tie'] = 0

    for pair in team_match_outcome_sum.index:
        team_match_df.loc[(pair[0], pair[1]), :]['Team_Win'] = team_match_df.loc[(pair[0], pair[1]), :]['Team_Home_Win'] + team_match_df.loc[(pair[1], pair[0]), :]['Team_Away_Win']
        team_match_df.loc[(pair[0], pair[1]), :]['Team_Tie'] = team_match_df.loc[(pair[0], pair[1]), :]['Team_Home_Tie'] + team_match_df.loc[(pair[1], pair[0]), :]['Team_Home_Tie']

    return team_match_df
In [40]:
team_match_outcome_sum = add_home_and_away_wins_ties(team_match_outcome_sum)
In [41]:
team_match_outcome_sum.head()
Out[41]:
Team_Home_Win Team_Home_Tie Team_Away_Win Team_Win Team_Tie
Team_Home Team_Away
Athletic Club de Bilbao Atlético Madrid 3 0 5 5 1
CA Osasuna 6 0 0 9 1
CD Numancia 1 0 0 2 0
CD Tenerife 1 0 0 1 0
Córdoba CF 0 0 1 1 0

Interest 9a: We are interested in computing and plotting number of wins for each team against each other team. This way we can see if a team is performing very well in terms of total wins, over all seasons. A heatmap, with each team on the y-axis and each team on the x-axis and total wins from each team on the y-axis over each team on the x-axis as values, should help us get an idea.

Question 9a: How many times did each team win, against other teams, over all seasons?

Answer 9a: Wins against other teams

In [42]:
team_match_win_sum = reset_index_pivot_plot_series('heat_map', 'autumn_r', team_match_outcome_sum, \
    'Team_Home', 'Team_Away', 'Team_Win', \
    'Wins over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team', 'Team', True, True)

team_match_win_sum.head()
Out[42]:
Team_Away Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Home
Athletic Club de Bilbao NaN 5.0 9.0 2.0 1.0 1.0 1.0 1.0 7.0 3.0 ... 4.0 5.0 4.0 2.0 7.0 7.0 0.0 4.0 3.0 2.0
Atlético Madrid 10.0 NaN 7.0 1.0 1.0 2.0 4.0 2.0 10.0 8.0 ... 6.0 6.0 6.0 4.0 6.0 4.0 2.0 6.0 6.0 1.0
CA Osasuna 2.0 3.0 NaN 1.0 1.0 NaN 1.0 2.0 3.0 1.0 ... 3.0 3.0 5.0 NaN 2.0 3.0 NaN 2.0 3.0 1.0
CD Numancia 0.0 0.0 0.0 NaN NaN NaN NaN 1.0 1.0 NaN ... 1.0 1.0 NaN NaN 0.0 1.0 NaN 1.0 0.0 NaN
CD Tenerife 1.0 0.0 1.0 NaN NaN NaN NaN 0.0 1.0 NaN ... 2.0 0.0 0.0 NaN 0.0 0.0 NaN 0.0 0.0 1.0

5 rows × 33 columns

Conclusion 9a: This graph shows three best performing teams in terms of total wins against other teams, home or away, FC Barcelona, Real Madrid and Atletico Madrid. It also shows few of the worst performing teams like Numancia, Cordoba and Deportivo.

Interest 9b: We are interested in computing and plotting number of ties for each team against each other team. This way we can see if a team is performing equally compared to other team, in terms of total ties, over all seasons. A heatmap, with each team on the y-axis and each team on the x-axis and total ties for each team on the y-axis against each team on the x-axis as values, should help us get an idea.

Question 9b: How many times did each team tie, against other teams, over all seasons?

Answer 9b: Ties against other teams

Note that this is the first time we encounter a symmetric matrix where entry[team1, team2] is the same as entry[team2, team1]

In [43]:
team_match_tie_sum = reset_index_pivot_plot_series('heat_map', 'rainbow', team_match_outcome_sum, \
    'Team_Home', 'Team_Away', 'Team_Tie', \
    'Home ties over '+str(unique_seasons.size)+' Seasons: '+unique_seasons.iloc[0]+' to '+unique_seasons.iloc[-1], \
    'Team', 'Team', True, True)

team_match_tie_sum.head()
Out[43]:
Team_Away Athletic Club de Bilbao Atlético Madrid CA Osasuna CD Numancia CD Tenerife Córdoba CF Elche CF FC Barcelona Getafe CF Granada CF ... Real Sporting de Gijón Real Valladolid Real Zaragoza SD Eibar Sevilla FC UD Almería UD Las Palmas Valencia CF Villarreal CF Xerez Club Deportivo
Team_Home
Athletic Club de Bilbao NaN 1.0 1.0 0.0 0.0 0.0 2.0 3.0 5.0 3.0 ... 5.0 2.0 1.0 1.0 2.0 1.0 2.0 5.0 4.0 0.0
Atlético Madrid 1.0 NaN 2.0 1.0 1.0 0.0 0.0 2.0 3.0 2.0 ... 2.0 0.0 1.0 0.0 6.0 4.0 0.0 5.0 3.0 0.0
CA Osasuna 1.0 2.0 NaN 1.0 0.0 NaN 1.0 2.0 5.0 2.0 ... 1.0 4.0 2.0 NaN 3.0 1.0 NaN 3.0 4.0 1.0
CD Numancia 0.0 1.0 1.0 NaN NaN NaN NaN 0.0 0.0 NaN ... 0.0 1.0 NaN NaN 0.0 0.0 NaN 0.0 0.0 NaN
CD Tenerife 0.0 1.0 0.0 NaN NaN NaN NaN 0.0 0.0 NaN ... 0.0 2.0 0.0 NaN 0.0 2.0 NaN 1.0 1.0 0.0

5 rows × 33 columns

Conclusion 9b: This graph shows that 2 pairs of teams seem to perform equally well, when each team within each pair, plays against the other team, Malaga vs Bilbao for a total of 8 ties and Espanyol vs Villarreal for a total of 7 ties.

Final question:

Question 10: Is there such a thing a home advantage?

Answer 10: Average goals scored home vs away

In [44]:
spain_liga_bbva_stats = spain_liga_bbva_df.describe()
spain_liga_bbva_stats['Home_Advantage'] = spain_liga_bbva_stats['Team_Home_Goal'] / spain_liga_bbva_stats['Team_Away_Goal']
In [45]:
spain_liga_bbva_stats
Out[45]:
Team_Home_Goal Team_Away_Goal Team_Home_Win Team_Home_Tie Team_Away_Win Home_Advantage
count 3040.000000 3040.000000 3040.000000 3040.000000 3040.000000 1.000000
mean 1.631250 1.135855 0.488487 0.231579 0.279934 1.436142
std 1.388339 1.161079 0.499950 0.421911 0.449041 1.195732
min 0.000000 0.000000 0.000000 0.000000 0.000000 NaN
25% 1.000000 0.000000 0.000000 0.000000 0.000000 inf
50% 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000
75% 2.000000 2.000000 1.000000 0.000000 1.000000 1.000000
max 10.000000 8.000000 1.000000 1.000000 1.000000 1.250000

Looking at the Home_Advantage column above we see that on average, a team scores 1.4 more goals when playing home compared to goals scored away. So there is a home advantage after all!

Possible future work: We explored several questions and we saw first hand the power and limitation of each visualization we used.

In the near future we could drill down into each team, especially the top 2 performing teams, and see what player attributes contributed the most in their wins or which players had the most penalties or what team attributes lead to the most victories?

Try it on your own! Feel free to take a deep dive into another Country and League

Simply go to cell 8 above where we defined our query to our local sqlite database and change the Country and League name filters to your own favorites.

...
Country = 'Spain' AND
League.name = 'Spain LIGA BBVA'
...

Everything until this point will work regardless of which Country and League you selected.

The remaining short part applies only to the Spanish BBVA League. As a reminder, you can adopt it to your own rival teams within your favorite League.

El Clásico

El Clásico is the name given in football to any match between fierce rivals FC Barcelona and Real Madrid.

https://en.wikipedia.org/wiki/El_Clásico

All El Clásico matches in our dataset

In [46]:
FCB = 'FC Barcelona'
RMCF = 'Real Madrid CF'
spain_liga_bbva_df[((spain_liga_bbva_df['Team_Home'] == FCB) & (spain_liga_bbva_df['Team_Away'] == RMCF)) \
                  | ((spain_liga_bbva_df['Team_Home'] == RMCF) & (spain_liga_bbva_df['Team_Away'] == FCB))]
Out[46]:
Country League Season Date Team_Home Team_Away Team_Home_Goal Team_Away_Goal Team_Home_Win Team_Home_Tie Team_Away_Win
140 Spain Spain LIGA BBVA 2008/2009 2008-12-13 FC Barcelona Real Madrid CF 2 0 1 0 0
331 Spain Spain LIGA BBVA 2008/2009 2009-05-02 Real Madrid CF FC Barcelona 2 6 0 0 1
497 Spain Spain LIGA BBVA 2009/2010 2009-11-29 FC Barcelona Real Madrid CF 1 0 1 0 0
684 Spain Spain LIGA BBVA 2009/2010 2010-04-10 Real Madrid CF FC Barcelona 0 2 0 0 1
889 Spain Spain LIGA BBVA 2010/2011 2010-11-29 FC Barcelona Real Madrid CF 5 0 1 0 0
1072 Spain Spain LIGA BBVA 2010/2011 2011-04-16 Real Madrid CF FC Barcelona 1 1 0 1 0
1283 Spain Spain LIGA BBVA 2011/2012 2011-12-10 Real Madrid CF FC Barcelona 1 3 0 0 1
1470 Spain Spain LIGA BBVA 2011/2012 2012-04-21 FC Barcelona Real Madrid CF 1 2 0 0 1
1589 Spain Spain LIGA BBVA 2012/2013 2012-10-07 FC Barcelona Real Madrid CF 2 2 0 1 0
1773 Spain Spain LIGA BBVA 2012/2013 2013-03-02 Real Madrid CF FC Barcelona 2 1 1 0 0
1994 Spain Spain LIGA BBVA 2013/2014 2013-10-26 FC Barcelona Real Madrid CF 2 1 1 0 0
2187 Spain Spain LIGA BBVA 2013/2014 2014-03-23 Real Madrid CF FC Barcelona 3 4 0 0 1
2364 Spain Spain LIGA BBVA 2014/2015 2014-10-25 Real Madrid CF FC Barcelona 3 1 1 0 0
2559 Spain Spain LIGA BBVA 2014/2015 2015-03-22 FC Barcelona Real Madrid CF 2 1 1 0 0
2770 Spain Spain LIGA BBVA 2015/2016 2015-11-21 Real Madrid CF FC Barcelona 0 4 0 0 1
2961 Spain Spain LIGA BBVA 2015/2016 2016-04-02 FC Barcelona Real Madrid CF 1 2 0 0 1

Match outcomes in goals when 'FC Barcelona' played at home (see corresponding graph and questions 7a, 8a above)

In [47]:
spain_liga_bbva_df[((spain_liga_bbva_df['Team_Home'] == FCB) & (spain_liga_bbva_df['Team_Away'] == RMCF)) ][['Team_Home_Goal','Team_Away_Goal']].sum()
Out[47]:
Team_Home_Goal    16
Team_Away_Goal     8
dtype: int64

Match outcomes in goals when 'Real Madrid CF' played at home (see corresponding graph and questions 7b, 8b above)

In [48]:
spain_liga_bbva_df[((spain_liga_bbva_df['Team_Home'] == RMCF) & (spain_liga_bbva_df['Team_Away'] == FCB)) ][['Team_Home_Goal','Team_Away_Goal']].sum()
Out[48]:
Team_Home_Goal    12
Team_Away_Goal    22
dtype: int64

Match outcomes in number of wins of 'FC Barcelona' over 'Real Madrid CF' (see corresponding graph and questions 9a above)

In [49]:
team_match_win_sum.loc[(FCB, RMCF)]
Out[49]:
10.0

Match outcomes in number of wins of 'Real Madrid CF' over 'FC Barcelona' (see corresponding graph and questions 9a above)

In [50]:
team_match_win_sum.loc[(RMCF, FCB)]
Out[50]:
4.0

Match outcomes in number of ties of the pair 'FC Barcelona', 'Real Madrid CF' (see corresponding graph and questions 9b above)

In [51]:
team_match_tie_sum.loc[(FCB, RMCF)] #same as team_match_tie_sum.loc[(RMCF, FCB)] since matrix team_match_tie_sum is symmetric
Out[51]:
2.0

Additional Material - Appendix

Matplotlib colormap reference

https://matplotlib.org/examples/color/colormaps_reference.html

Personal favorites for displaying positive impact: 'YlGnBu', 'cool'

Personal favorites for displaying negative impact: 'RdPu', 'magma_r', 'plasma_r'

In [52]:
#plot_colors = ['Accent', 'Accent_r', 'Blues', 'Blues_r', 'BrBG', 'BrBG_r', 'BuGn', 'BuGn_r', 'BuPu', 'BuPu_r', 'CMRmap', 'CMRmap_r', 'Dark2', 'Dark2_r', 'GnBu', 'GnBu_r', 'Greens', 'Greens_r', 'Greys', 'Greys_r', 'OrRd', 'OrRd_r', 'Oranges', 'Oranges_r', 'PRGn', 'PRGn_r', 'Paired', 'Paired_r', 'Pastel1', 'Pastel1_r', 'Pastel2', 'Pastel2_r', 'PiYG', 'PiYG_r', 'PuBu', 'PuBuGn', 'PuBuGn_r', 'PuBu_r', 'PuOr', 'PuOr_r', 'PuRd', 'PuRd_r', 'Purples', 'Purples_r', 'RdBu', 'RdBu_r', 'RdGy', 'RdGy_r', 'RdPu', 'RdPu_r', 'RdYlBu', 'RdYlBu_r', 'RdYlGn', 'RdYlGn_r', 'Reds', 'Reds_r', 'Set1', 'Set1_r', 'Set2', 'Set2_r', 'Set3', 'Set3_r', 'Spectral', 'Spectral_r', 'Vega10', 'Vega10_r', 'Vega20', 'Vega20_r', 'Vega20b', 'Vega20b_r', 'Vega20c', 'Vega20c_r', 'Wistia', 'Wistia_r', 'YlGn', 'YlGnBu', 'YlGnBu_r', 'YlGn_r', 'YlOrBr', 'YlOrBr_r', 'YlOrRd', 'YlOrRd_r', 'afmhot', 'afmhot_r', 'autumn', 'autumn_r', 'binary', 'binary_r', 'bone', 'bone_r', 'brg', 'brg_r', 'bwr', 'bwr_r', 'cool', 'cool_r', 'coolwarm', 'coolwarm_r', 'copper', 'copper_r', 'cubehelix', 'cubehelix_r', 'flag', 'flag_r', 'gist_earth', 'gist_earth_r', 'gist_gray', 'gist_gray_r', 'gist_heat', 'gist_heat_r', 'gist_ncar', 'gist_ncar_r', 'gist_rainbow', 'gist_rainbow_r', 'gist_stern', 'gist_stern_r', 'gist_yarg', 'gist_yarg_r', 'gnuplot', 'gnuplot2', 'gnuplot2_r', 'gnuplot_r', 'gray', 'gray_r', 'hot', 'hot_r', 'hsv', 'hsv_r', 'icefire', 'icefire_r', 'inferno', 'inferno_r', 'jet', 'jet_r', 'magma', 'magma_r', 'mako', 'mako_r', 'nipy_spectral', 'nipy_spectral_r', 'ocean', 'ocean_r', 'pink', 'pink_r', 'plasma', 'plasma_r', 'prism', 'prism_r', 'rainbow', 'rainbow_r', 'rocket', 'rocket_r', 'seismic', 'seismic_r', 'spectral', 'spectral_r', 'spring', 'spring_r', 'summer', 'summer_r', 'tab10', 'tab10_r', 'tab20', 'tab20_r', 'tab20b', 'tab20b_r', 'tab20c', 'tab20c_r', 'terrain', 'terrain_r', 'viridis', 'viridis_r', 'vlag', 'vlag_r', 'winter', 'winter_r']
plot_colors = ['YlGnBu', 'cool', 'RdPu', 'magma_r', 'plasma_r']

for plot_color in plot_colors:
    season_team_home_goal_sum = reset_index_pivot_plot_series('heat_map', plot_color, season_team_home_goal_sum, \
    'Team_Home', 'Season', 'Team_Home_Goal', plot_color + ' - Home goals scored per Season', 'Team', 'Home Goals Scored', False, False)